Principal Component Analysis (PCA) + Kernel PCA#

PCA is one of the most useful “first moves” in data science:

  • compress data while keeping what varies the most

  • denoise (sometimes) by dropping low-variance directions

  • visualize high-dimensional data in 2D/3D

  • understand how ideas like eigenvectors show up in real ML

This notebook mirrors the style of the supervised-learning notebooks: intuition first, then the math, then a from-scratch implementation + visuals.


Learning goals#

By the end you should be able to:

  • explain PCA in plain language (“rotate the camera to the widest view”)

  • derive PCA from variance maximization and the covariance matrix

  • implement PCA from scratch using NumPy (center → covariance → eigendecompose → project)

  • understand Kernel PCA and the kernel trick at a high level

  • interpret explained variance and know when PCA can mislead you

Notation (quick)#

  • Data matrix: \(X \in \mathbb{R}^{n\times d}\) (rows = samples, columns = features)

  • Centered data: \(X_c = X - \mathbf{1}\mu^T\) where \(\mu\) is the feature-wise mean

  • Covariance matrix: \(\Sigma = \frac{1}{n-1} X_c^T X_c \in \mathbb{R}^{d\times d}\)

  • Principal axes (directions): columns of \(V\) (or rows of components_)

  • Scores (coordinates in PC space): \(Z = X_c V_k\)


Table of contents#

  1. Intuition (very simple)

  2. Statistical & linear algebra explanation

  3. Step-by-step algorithm (from scratch)

  4. Kernel PCA extension (why + how)

  5. Visual explanations (Plotly)

  6. When PCA works / fails

  7. Comparison table (PCA vs Kernel PCA vs ICA)

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from dataclasses import dataclass

from plotly.subplots import make_subplots

from sklearn.datasets import make_moons
from sklearn.decomposition import PCA, KernelPCA
from sklearn.preprocessing import StandardScaler


pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Intuition (very simple)#

Explain PCA like you’re 10:

Imagine you have a handful of dots on the floor and you want to take a picture that shows the dots as “spread out” as possible.

Metaphor 1: “rotate the camera to see the widest view”#

  • If you point your camera the wrong way, the dots might look squished.

  • If you rotate the camera to the best angle, the dots look most spread out.

PCA is the math version of:

“Rotate the axes until the data looks widest in one direction.”

That “widest direction” is the first principal component.

Metaphor 2: “shadow on the wall”#

Hold a 3D object in front of a lamp. The object casts a shadow on the wall.

  • Different angles give different shadows.

  • PCA chooses the angle where the shadow is most spread out.

Then PCA can also choose a second direction (at a right angle) that’s the next most spread out shadow.

What PCA is not#

  • It doesn’t find “the best features” in a semantic sense.

  • It finds directions of maximum variance (biggest spread), which is a statistical idea.

2) Statistical & linear algebra explanation#

2.1 Covariance matrix (what PCA is built from)#

After centering, the covariance matrix is:

\[ \Sigma = \frac{1}{n-1} X_c^T X_c. \]
  • Diagonal entries: variances of each feature

  • Off-diagonal entries: how two features vary together

PCA looks for directions (unit vectors) \(w\) where the projected variance is large.

2.2 Variance maximization → eigenvectors#

Project data onto a unit direction \(w \in \mathbb{R}^d\):

\[ z = X_c w. \]

The variance of \(z\) is:

\[ \mathrm{Var}(z) = \mathrm{Var}(X_c w) = w^T \Sigma w. \]

PCA’s first component solves:

\[ \max_{\|w\|=1} \; w^T \Sigma w. \]

Using a Lagrange multiplier (constraint \(\|w\|^2 = 1\)), you get:

\[ \Sigma w = \lambda w. \]

So:

  • eigenvectors of \(\Sigma\) are candidate directions

  • eigenvalues tell you how much variance is along that direction

The largest eigenvalue’s eigenvector is the first principal component.

2.3 Orthogonality (why PCs are at right angles)#

The covariance matrix \(\Sigma\) is symmetric, so it has an orthonormal eigenbasis:

  • eigenvectors are orthogonal (perpendicular)

  • we can choose them to have unit length

This gives you a clean story:

  • PC1 captures the most variance

  • PC2 captures the most remaining variance subject to being orthogonal to PC1

  • and so on

2.4 Relationship to SVD (another way to compute PCA)#

Take the SVD of the centered data:

\[ X_c = U S V^T. \]

Then:

\[ \Sigma = \frac{1}{n-1} X_c^T X_c = \frac{1}{n-1} V S^2 V^T. \]

So:

  • the columns of \(V\) are the PCA directions (principal axes)

  • the eigenvalues of \(\Sigma\) are \(\lambda_i = \frac{S_i^2}{n-1}\)

In practice, SVD is often the numerically stable way to compute PCA.

3) Step-by-step algorithm (from scratch)#

Given \(X \in \mathbb{R}^{n\times d}\):

  1. Center the data: \(X_c = X - \mu\)

  2. Compute covariance: \(\Sigma = \frac{1}{n-1} X_c^T X_c\)

  3. Eigen-decompose: \(\Sigma v_i = \lambda_i v_i\)

  4. Sort eigenvalues descending, keep the top \(k\)

  5. Project: \(Z = X_c V_k\)

Where \(V_k\) is the matrix of the top-\(k\) eigenvectors.

Below we’ll implement PCA in NumPy and sanity-check it against sklearn.

# A 2D dataset with strong correlation (nice for PCA visuals)
n = 350
Z = rng.normal(size=(n, 2))

A = np.array(
    [
        [3.0, 1.2],
        [0.0, 0.7],
    ]
)
X2 = Z @ A.T
X2 = X2 + np.array([2.0, -1.0])

fig = px.scatter(
    x=X2[:, 0],
    y=X2[:, 1],
    title="Correlated 2D data (we'll find the best rotation)",
    labels={"x": "x1", "y": "x2"},
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
@dataclass
class PCAFit:
    mean_: np.ndarray
    components_: np.ndarray  # shape: (k, d)
    explained_variance_: np.ndarray  # shape: (k,)
    explained_variance_ratio_: np.ndarray  # shape: (k,)

    def transform(self, X: np.ndarray) -> np.ndarray:
        Xc = X - self.mean_
        return Xc @ self.components_.T

    def inverse_transform(self, Z: np.ndarray) -> np.ndarray:
        return Z @ self.components_ + self.mean_


def pca_fit(X: np.ndarray, n_components: int | None = None) -> PCAFit:
    X = np.asarray(X, dtype=float)
    n_samples, n_features = X.shape

    mean = X.mean(axis=0)
    Xc = X - mean

    cov = (Xc.T @ Xc) / (n_samples - 1)

    # Symmetric matrix => use eigh (stable, returns ascending eigenvalues)
    eigvals, eigvecs = np.linalg.eigh(cov)

    order = np.argsort(eigvals)[::-1]
    eigvals = eigvals[order]
    eigvecs = eigvecs[:, order]

    total_var = eigvals.sum()

    if n_components is None:
        k = n_features
    else:
        k = int(n_components)

    components = eigvecs[:, :k].T
    explained_variance = eigvals[:k]
    explained_variance_ratio = explained_variance / total_var

    return PCAFit(
        mean_=mean,
        components_=components,
        explained_variance_=explained_variance,
        explained_variance_ratio_=explained_variance_ratio,
    )


pca2 = pca_fit(X2, n_components=2)
pca2.explained_variance_ratio_
array([0.9705, 0.0295])
sk_pca2 = PCA(n_components=2)
Z_sklearn = sk_pca2.fit_transform(X2)

print("from scratch explained variance ratio:", pca2.explained_variance_ratio_)
print("sklearn explained variance ratio:     ", sk_pca2.explained_variance_ratio_)

# Compare directions up to sign (PCA axes can flip sign and still be 'the same')
print()
print("from scratch components_ (rows):")
print(pca2.components_)
print()
print("sklearn components_ (rows):")
print(sk_pca2.components_)
from scratch explained variance ratio: [0.9705 0.0295]
sklearn explained variance ratio:      [0.9705 0.0295]

from scratch components_ (rows):
[[-0.9954 -0.0953]
 [ 0.0953 -0.9954]]

sklearn components_ (rows):
[[ 0.9954  0.0953]
 [-0.0953  0.9954]]

4) Kernel PCA extension (non-linear structure)#

4.1 Motivation: when linear PCA is the wrong shape#

Linear PCA can only:

  • rotate your coordinate system

  • then optionally drop dimensions

So if your data lives on a curved shape (like two moons), linear PCA can’t “unbend” it.

Kernel PCA keeps the PCA idea (maximize variance) but does it in a feature space where the data may become more linear.

4.2 Kernel trick intuition#

PCA needs dot products like \(\langle x, z \rangle\).

Kernel trick idea:

  • pretend we mapped points into a huge feature space: \(\phi(x)\)

  • but instead of computing \(\phi(x)\) explicitly, we compute:

\[ k(x, z) = \langle \phi(x), \phi(z) \rangle. \]

Common kernels:

  • RBF (Gaussian): \(k(x,z)=\exp(-\gamma\|x-z\|^2)\)

  • Polynomial: \(k(x,z)=(x^T z + c)^p\)

4.3 Mathematical formulation (high level)#

Let \(\Phi\) be the centered feature-space data matrix (rows are \(\phi(x_i)\)). Kernel PCA avoids forming \(\Phi\) and works with the kernel matrix:

\[ K_{ij} = k(x_i, x_j). \]

We must center it (equivalent to centering in feature space):

\[ K_c = K - \mathbf{1}K - K\mathbf{1} + \mathbf{1}K\mathbf{1}, \quad \text{where } \mathbf{1} = \frac{1}{n}\mathbf{1}_n\mathbf{1}_n^T. \]

Then we eigendecompose:

\[ K_c \alpha = \lambda \alpha. \]

A point \(x\) is projected using kernel similarities to training points:

\[ z_m(x) = \sum_{i=1}^n \alpha_{m,i}\, k_c(x_i, x). \]

(Details vary by normalization convention, but the core idea is: projection = weighted sum of kernel evaluations.)

4.4 Linear PCA vs Kernel PCA (conceptually)#

  • Linear PCA: pick a straight line / plane that captures the most spread.

  • Kernel PCA: first “bend” space using a kernel (implicitly), then do PCA there.

5) Visual explanations (Plotly)#

We’ll build four interactive visuals:

  1. Original data vs principal axes

  2. Explained variance bar chart

  3. 2D → 1D projection animation

  4. Linear vs Kernel PCA comparison

# 5.1 Original data + principal axes (PC1, PC2)
mu = pca2.mean_
pc1, pc2 = pca2.components_[0], pca2.components_[1]

# Scale arrows by a few standard deviations along each PC
s1 = 3.0 * float(np.sqrt(pca2.explained_variance_[0]))
s2 = 3.0 * float(np.sqrt(pca2.explained_variance_[1]))

p1a, p1b = mu - s1 * pc1, mu + s1 * pc1
p2a, p2b = mu - s2 * pc2, mu + s2 * pc2

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=X2[:, 0],
        y=X2[:, 1],
        mode="markers",
        marker=dict(size=7, opacity=0.75, line=dict(width=0.5, color="white")),
        name="data",
    )
)
fig.add_trace(go.Scatter(x=[p1a[0], p1b[0]], y=[p1a[1], p1b[1]], mode="lines", name="PC1"))
fig.add_trace(go.Scatter(x=[p2a[0], p2b[0]], y=[p2a[1], p2b[1]], mode="lines", name="PC2"))
fig.add_trace(
    go.Scatter(
        x=[mu[0]],
        y=[mu[1]],
        mode="markers",
        marker=dict(size=10, symbol="x", color="black"),
        name="mean",
    )
)
fig.update_layout(
    title="PCA on 2D data: principal axes are the best 'camera rotation'",
    xaxis_title="x1",
    yaxis_title="x2",
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
# 5.2 Explained variance (a slightly higher-dimensional example)

n = 600
latent = rng.normal(size=(n, 2))
latent[:, 0] *= 3.0  # big variance direction
latent[:, 1] *= 1.0

W = rng.normal(size=(2, 6))
X6 = latent @ W + 0.25 * rng.normal(size=(n, 6))

pca6 = pca_fit(X6, n_components=6)
ratios = pca6.explained_variance_ratio_
cum = np.cumsum(ratios)

pcs = [f"PC{i+1}" for i in range(len(ratios))]

fig = go.Figure()
fig.add_trace(go.Bar(x=pcs, y=ratios, name="explained variance"))
fig.add_trace(go.Scatter(x=pcs, y=cum, mode="lines+markers", name="cumulative"))
fig.update_layout(
    title="Explained variance ratio",
    yaxis_title="fraction of total variance",
)
fig.update_yaxes(tickformat=".0%")
fig.show()
# 5.3 2D → 1D projection animation (onto PC1)

Xc = X2 - pca2.mean_
z1 = Xc @ pca2.components_[0]
X_proj = pca2.mean_ + np.outer(z1, pca2.components_[0])

# A line representing PC1
pc1_dir = pca2.components_[0]
span = 4.0 * float(np.sqrt(pca2.explained_variance_[0]))
line_a, line_b = pca2.mean_ - span * pc1_dir, pca2.mean_ + span * pc1_dir

x_min = float(min(X2[:, 0].min(), X_proj[:, 0].min()) - 0.5)
x_max = float(max(X2[:, 0].max(), X_proj[:, 0].max()) + 0.5)
y_min = float(min(X2[:, 1].min(), X_proj[:, 1].min()) - 0.5)
y_max = float(max(X2[:, 1].max(), X_proj[:, 1].max()) + 0.5)

ts = np.linspace(0.0, 1.0, 21)
frames = []
for t in ts:
    Xt = (1 - t) * X2 + t * X_proj
    frames.append(
        go.Frame(
            data=[go.Scatter(x=Xt[:, 0], y=Xt[:, 1])],
            name=f"t={t:.2f}",
            traces=[0],
        )
    )

fig = go.Figure(
    data=[
        go.Scatter(
            x=X2[:, 0],
            y=X2[:, 1],
            mode="markers",
            marker=dict(size=7, opacity=0.75, line=dict(width=0.5, color="white")),
            name="points",
        ),
        go.Scatter(
            x=[line_a[0], line_b[0]],
            y=[line_a[1], line_b[1]],
            mode="lines",
            name="PC1 axis",
        ),
    ],
    frames=frames,
)

fig.update_layout(
    title="Projecting 2D points down to 1D (onto PC1)",
    xaxis_title="x1",
    yaxis_title="x2",
    updatemenus=[
        {
            "type": "buttons",
            "showactive": False,
            "buttons": [
                {
                    "label": "Play",
                    "method": "animate",
                    "args": [None, {"frame": {"duration": 80, "redraw": True}, "fromcurrent": True}],
                },
                {
                    "label": "Pause",
                    "method": "animate",
                    "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                },
            ],
        }
    ],
    sliders=[
        {
            "steps": [
                {
                    "method": "animate",
                    "args": [[f"t={t:.2f}"], {"mode": "immediate", "frame": {"duration": 0, "redraw": True}}],
                    "label": f"{t:.2f}",
                }
                for t in ts
            ],
            "currentvalue": {"prefix": "t = "},
        }
    ],
)

fig.update_xaxes(range=[x_min, x_max])
fig.update_yaxes(range=[y_min, y_max], scaleanchor="x", scaleratio=1)

fig.show()
# 5.4 Linear PCA vs Kernel PCA on a non-linear dataset (two moons)

X_moons, y_moons = make_moons(n_samples=500, noise=0.12, random_state=7)
Xs = StandardScaler().fit_transform(X_moons)

Z_lin = PCA(n_components=2).fit_transform(Xs)
Z_k = KernelPCA(n_components=2, kernel="rbf", gamma=10.0).fit_transform(Xs)

colors = {0: "#1f77b4", 1: "#d62728"}

fig = make_subplots(
    rows=1,
    cols=3,
    subplot_titles=[
        "Original space (2D)",
        "Linear PCA embedding",
        "Kernel PCA embedding (RBF)",
    ],
)

for label in [0, 1]:
    m = y_moons == label

    fig.add_trace(
        go.Scatter(
            x=Xs[m, 0],
            y=Xs[m, 1],
            mode="markers",
            marker=dict(size=6, color=colors[label], opacity=0.75),
            name=f"class {label}",
            showlegend=True,
        ),
        row=1,
        col=1,
    )

    fig.add_trace(
        go.Scatter(
            x=Z_lin[m, 0],
            y=Z_lin[m, 1],
            mode="markers",
            marker=dict(size=6, color=colors[label], opacity=0.75),
            name=f"class {label}",
            showlegend=False,
        ),
        row=1,
        col=2,
    )

    fig.add_trace(
        go.Scatter(
            x=Z_k[m, 0],
            y=Z_k[m, 1],
            mode="markers",
            marker=dict(size=6, color=colors[label], opacity=0.75),
            name=f"class {label}",
            showlegend=False,
        ),
        row=1,
        col=3,
    )

fig.update_xaxes(title_text="x1 (scaled)", row=1, col=1)
fig.update_yaxes(title_text="x2 (scaled)", row=1, col=1)

fig.update_xaxes(title_text="PC1", row=1, col=2)
fig.update_yaxes(title_text="PC2", row=1, col=2)

fig.update_xaxes(title_text="KPCA1", row=1, col=3)
fig.update_yaxes(title_text="KPCA2", row=1, col=3)

fig.update_layout(
    title="Linear vs Kernel PCA: kernel can 'unfold' non-linear structure",
    legend_title_text="",
)

fig.show()

6) When PCA works / fails#

When PCA tends to work well#

  • You believe the “signal” is in large-variance directions.

  • Features are roughly on the same scale (or you standardize).

  • You want a quick baseline for compression/visualization.

Common failure modes#

Noise sensitivity#

  • PCA is variance-hungry: if noise creates variance, PCA may chase it.

  • In very noisy settings, PC1 can become “the noisiest direction”, not “the most meaningful direction”.

Scaling issues#

  • PCA is not unitless — it depends on variance.

  • If one feature is measured in dollars and another in cents, the dollar feature can dominate.

  • Fix: standardize (StandardScaler) or use a domain-appropriate scaling.

Linear limitation#

  • PCA can only find linear directions.

  • If the data lies on a curve/manifold (moons, circles), linear PCA won’t separate the structure.

  • Fix: Kernel PCA (or other non-linear methods like UMAP/t-SNE for visualization).


Practical checklist#

  • Always center your data.

  • Decide whether you need scaling:

    • yes for mixed units / mixed scales

    • sometimes no if scale itself is meaningful

  • Pick n_components using explained variance and downstream performance.

  • Don’t over-interpret PCs as “real factors” without domain validation.

# A quick scaling demo: one feature has a much larger numeric scale

n = 400
x1 = rng.normal(0, 1.0, size=n)
# Same underlying signal, but scaled by 50x
x2 = 50.0 * (0.8 * x1 + rng.normal(0, 0.4, size=n))
X_scale = np.c_[x1, x2]

p_unscaled = pca_fit(X_scale, n_components=2)

Xs_scaled = StandardScaler().fit_transform(X_scale)
p_scaled = pca_fit(Xs_scaled, n_components=2)

print("Explained variance ratio (unscaled):", p_unscaled.explained_variance_ratio_)
print("Explained variance ratio (scaled):  ", p_scaled.explained_variance_ratio_)
Explained variance ratio (unscaled): [0.9999 0.0001]
Explained variance ratio (scaled):   [0.9427 0.0573]

7) Comparison table: PCA vs Kernel PCA vs ICA#

Method

What it tries to do

Linear?

What “components” mean

Typical uses

Key gotchas

PCA

Maximize variance with orthogonal directions

Orthogonal axes capturing decreasing variance

Compression, denoising, visualization, preprocessing

Sensitive to scaling; can chase noise; linear only

Kernel PCA

PCA in an implicit feature space via a kernel

❌ (can be non-linear)

Non-linear directions in feature space (via kernel eigenvectors)

Unfold curved structure; non-linear embeddings

Kernel/gamma choice matters; scaling still matters; can be expensive

ICA

Separate statistically independent sources

✅ (typically)

Directions that make components as independent as possible

Blind source separation (signals), artifact removal

More assumptions; components not ordered by variance; sign/scale ambiguities


Exercises#

  1. Implement PCA via SVD and verify it matches the covariance-eigendecomposition approach.

  2. On the moons dataset, sweep gamma in Kernel PCA and observe when the embedding “unfolds” vs when it becomes noisy.

  3. Create a dataset where the highest variance direction is pure noise — see PCA fail on purpose.

References#

  • ESL (Hastie, Tibshirani, Friedman): The Elements of Statistical Learning

  • Bishop: Pattern Recognition and Machine Learning

  • Schölkopf, Smola, Müller (1998): Nonlinear Component Analysis as a Kernel Eigenvalue Problem (Kernel PCA)

  • sklearn docs: PCA, KernelPCA, FastICA